#to double check if got use DT, scales, ggpubr
pacman::p_load(ggiraph, plotly, patchwork, DT, tidyverse,
knitr,scales,ggstatsplot,paletteer,wesanderson,
ggpubr,crosstalk,gganimate, ggridges, rstatix,gt,webshot2,png,nortest)
options(scipen = 999)Take Home Exercise 1
Visual Analytics of the demographic and financial characteristics of residents in City of Engagement
1. OVERVIEW
City of Engagement is a small city located at Country of Nowhere, with a total population of 50,000, serving as an agriculture region. The local council of the city is in the midst of preparing the Local Plan 2023.
1.1 The Task
In this take-home exercise, patterns containing demographics and financial characteristics of residents in City of Engagement will be unveiled using appropriate static and interactive statistical graphics methods.
2. Datasets
Data has been collected by the local council of the city. The survey sampled 1,000 respondents to collect data related to their household demographic, spending patterns and among, other things. The data is stored in two separate files ranging from Mar 2022 to Feb 2023 :
| 1) Participants.csv | 2) FinancialJourval.csv | |
|---|---|---|
| Rows | 1,011 | 1,513,636 |
| Variables | 7 | 4 |
2.1 Metadata
| File | Variables Name | Description |
|---|---|---|
participants.csv FinancialJourval.csv |
participantId | Unique identification to represent the Participants |
| participants.csv | householdSize | Represents the number of people in the household |
| participants.csv | haveKids | Binary value (True/False) indicating if participant have a kid(s) |
| participants.csv | age | Represents age of the participant |
| participants.csv | educationLevel | Represents the highest education attained by participant |
| FinancialJourval.csv | interestGroup | Represents the group associated with the participant |
| FinancialJourval.csv | joviality | Represents the level of happiness by participant at the start of the survey |
| FinancialJourval.csv | timestamp | Represents the date and time the entry was inputted |
| FinancialJourval.csv | category | Represents the type of income/expenses incurred at a given timestamp |
| FinancialJourval.csv | amount | Represents the amount received ( + income), amount paid ( - expenses ) |
3. Data Preparation
3.1 Install R-packages
Using p_load() of pacman package to load and install the following libraries:
ggiraph: For creating interactive ‘ggplot’ graphicsplotly: For creating interactive statistical graphspatchwork: For combining multiple ggplot2 graphs into one figuretidyverse: A collection of R packages use in everyday data analyses. It is able to support data science, data wrangling, and analysis.knitr: For dynamic report generationggstatsplot: For creating graphics with details from statistical tests included and its plotpaletteer: Collection of color paletteswesanderson: Wes Anderson’s theme Palette Generator
options(scipen = 999) : removes scientific notation in our exercise.
3.2 Import Data
3.2.1 Import participants dataset
participants <- read_csv("data/Participants.csv")3.2.2 Load participants
# A tibble: 6 × 7
participantId householdSize haveKids age educationLevel interestGroup
<dbl> <dbl> <lgl> <dbl> <chr> <chr>
1 0 3 TRUE 36 HighSchoolOrCollege H
2 1 3 TRUE 25 HighSchoolOrCollege B
3 2 3 TRUE 35 HighSchoolOrCollege A
4 3 3 TRUE 21 HighSchoolOrCollege I
5 4 3 TRUE 43 Bachelors H
6 5 3 TRUE 32 HighSchoolOrCollege D
# ℹ 1 more variable: joviality <dbl>
head(participants)3.2.3 Import Financial Journal dataset
financial_journal <- read_csv("data/FinancialJournal.csv")3.2.4 Load Financial Journal
# A tibble: 6 × 4
participantId timestamp category amount
<dbl> <dttm> <chr> <dbl>
1 0 2022-03-01 00:00:00 Wage 2473.
2 0 2022-03-01 00:00:00 Shelter -555.
3 0 2022-03-01 00:00:00 Education -38.0
4 1 2022-03-01 00:00:00 Wage 2047.
5 1 2022-03-01 00:00:00 Shelter -555.
6 1 2022-03-01 00:00:00 Education -38.0
head(financial_journal)3.3 Data Wrangling
As seen from the two data tables above, there are some issues that could be rectify. Henceforth, the following adjustments are made:
3.3.1 participants.csv :
participantId is a
<dbl>variable. (Rectify by reformatting it to<chr>)householdSize is a
<dbl>variable. (Revised to<ord>for the order of categories)age is a continuous variable which makes it harder to visualize the demographics (Create a new column with 5-class variables after determining the youngest and oldest demographics.)
Show the code
#check min and max age of residents in COE.
min(participants$age)[1] 18
Show the code
max(participants$age)[1] 60
educationLevel is a
<chr>variable. (Revised to<ord>for the order of categories)joviality has nine decimal places. (Rectify by rounding it to 2.d.p and create new 5-class variables for future analysis)
Show the code
#create new dataset
participants_new <- participants %>%
mutate(
participantId = as.character(participantId),
#binned joviality to 5-class variables
joviality_bins = cut(joviality, breaks = c(0.0,0.2,0.4,0.6,0.8,1.0))
)
#reformat householdSize to Ordinal
participants_new$householdSize <- factor(participants$householdSize,
levels = c("1", "2", "3"),
ordered = TRUE)
#reformat age group
participants_new$age_group <- factor(ifelse(participants$age < 20, "Under 20",
ifelse(participants$age < 30, "20-29",
ifelse(participants$age < 40, "30-39",
ifelse(participants$age < 50, "40-49", "Above 50")))),
levels = c("Under 20", "20-29", "30-39", "40-49", "Above 50"),
ordered= TRUE)
#reformat education level to Ordinal
participants_new$educationLevel <- factor(participants$educationLevel,
levels = c("Low", "HighSchoolOrCollege",
"Bachelors", "Graduate"
),
ordered = TRUE)
#round up joviality to 2 decimal places
participants_new$joviality <- round(participants$joviality, 2)
#output data frame
participants_new# A tibble: 1,011 × 9
participantId householdSize haveKids age educationLevel interestGroup
<chr> <ord> <lgl> <dbl> <ord> <chr>
1 0 3 TRUE 36 HighSchoolOrCollege H
2 1 3 TRUE 25 HighSchoolOrCollege B
3 2 3 TRUE 35 HighSchoolOrCollege A
4 3 3 TRUE 21 HighSchoolOrCollege I
5 4 3 TRUE 43 Bachelors H
6 5 3 TRUE 32 HighSchoolOrCollege D
7 6 3 TRUE 26 HighSchoolOrCollege I
8 7 3 TRUE 27 Bachelors A
9 8 3 TRUE 20 Bachelors G
10 9 3 TRUE 35 Bachelors D
# ℹ 1,001 more rows
# ℹ 3 more variables: joviality <dbl>, joviality_bins <fct>, age_group <ord>
3.3.2 FinancialJourval.csv :
Similar issue as point 1 above.
Timestamp is a
<POSIX>variable. (Rectify by reformatting it to<chr>in year-mth format)As per the code below, there are duplicate entries in the financial journal. (Rectify by using the
distinct()function from[dplyr package])
Show the code
#check for duplicates
dup <- (nrow(financial_journal) - nrow(unique(financial_journal)))
#reformat output
dup_reformat <- format(dup, big.mark=",")
#print output
dup_reformat[1] "1,113"
- “Category” is not very useful. (Rectify by using
pivot_wider()function from[tidyr package]to transpose)
Show the code
#remove duplicate rows for all columns
financial_journal_lessdup <- financial_journal %>%
distinct()
#create new dataset
grouped_data <- financial_journal_lessdup %>%
#recode participantId from dbl to chr, format timestamp to year_mth and round amount
mutate(participantId = as.character(participantId),
year_mth = format(as.Date(financial_journal_lessdup$timestamp), "%Y-%m"),
amount = abs(round(amount,2)),
.before = 3) %>%
#group the columns in the following order and sum the amount to total_amount
group_by( participantId, year_mth,category) %>%
summarize(total_amount = sum(amount))
# pivot the data frame to have categories as columns
pivoted_fj <- grouped_data %>%
pivot_wider(names_from = "category",
values_from = "total_amount", values_fill = 0)
# create new column from list of categories
pivoted_fj$Shelter_new <- pivoted_fj$Shelter + pivoted_fj$RentAdjustment
pivoted_fj$Expenses <- pivoted_fj$Education + pivoted_fj$Food +
pivoted_fj$Recreation + pivoted_fj$Shelter + pivoted_fj$RentAdjustment
pivoted_fj$Income <- pivoted_fj$Wage
pivoted_fj$Cashflow <- pivoted_fj$Income - pivoted_fj$Expenses
# output the pivoted data frame
pivoted_fj %>%
select(c(1:5,7,9:12))# A tibble: 10,691 × 10
# Groups: participantId, year_mth [10,691]
participantId year_mth Education Food Recreation Wage Shelter_new Expenses
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 2022-03 38.0 268. 349. 11932. 555. 1210.
2 0 2022-04 38.0 266. 219. 8637. 555. 1078.
3 0 2022-05 38.0 265. 383. 9048. 555. 1241.
4 0 2022-06 38.0 257. 466. 9048. 555. 1316.
5 0 2022-07 38.0 270. 1069. 8637. 555. 1933.
6 0 2022-08 38.0 262. 314. 9459. 555. 1169.
7 0 2022-09 38.0 256. 295. 9048. 555. 1144.
8 0 2022-10 38.0 267. 25.0 8637. 555. 885.
9 0 2022-11 38.0 261 377. 9048. 555. 1231.
10 0 2022-12 38.0 266. 357. 9048. 555. 1216.
# ℹ 10,681 more rows
# ℹ 2 more variables: Income <dbl>, Cashflow <dbl>
Type of Income/Expenses are all labelled in a column. (Create new columns)
Data Table Variables Name Description resident_profile_rev Income Category :Wage resident_profile_rev Expenses Category: Education + Recreation + Food + Shelter_new (Shelter + RentAdjustment) resident_profile_rev Cashflow NEW : Income - Expenses Multiple zero values for RentAdjustment. As seen from the data frame above and the code chunk below, there are only 72 rows. Thus, it will be combined with Shelter.
colSums(pivoted_fj[-1] !=0) year_mth Education Food Recreation Shelter
10691 3018 10691 9492 10560
Wage RentAdjustment Shelter_new Expenses Income
10691 72 10560 10691 10691
Cashflow
10691
3.4 Merging of Data frame
Full_join will be used to create a new table by joining the cleaned participants file and pivoted financial journal. It will be match by participant’s ID. Likewise, the sequence in the dataset will be relocated to highlight several columns.
Show the code
#join both data sets
resident_profile <- full_join(participants_new, pivoted_fj,
by = c("participantId" = "participantId")) %>%
#relocate columns to the front (by importance)
relocate(year_mth, .after =participantId) %>%
relocate(Cashflow, .after = year_mth) %>%
relocate(age_group, .after = Cashflow) %>%
relocate(educationLevel, .after = age_group) %>%
relocate(Income, .after = haveKids) %>%
relocate(Expenses , .after = Income)
resident_profile %>%
select(c(1:18))# A tibble: 10,691 × 18
participantId year_mth Cashflow age_group educationLevel householdSize
<chr> <chr> <dbl> <ord> <ord> <ord>
1 0 2022-03 10722. 30-39 HighSchoolOrCollege 3
2 0 2022-04 7559. 30-39 HighSchoolOrCollege 3
3 0 2022-05 7808. 30-39 HighSchoolOrCollege 3
4 0 2022-06 7733. 30-39 HighSchoolOrCollege 3
5 0 2022-07 6704. 30-39 HighSchoolOrCollege 3
6 0 2022-08 8291. 30-39 HighSchoolOrCollege 3
7 0 2022-09 7904. 30-39 HighSchoolOrCollege 3
8 0 2022-10 7752. 30-39 HighSchoolOrCollege 3
9 0 2022-11 7817. 30-39 HighSchoolOrCollege 3
10 0 2022-12 7832. 30-39 HighSchoolOrCollege 3
# ℹ 10,681 more rows
# ℹ 12 more variables: haveKids <lgl>, Income <dbl>, Expenses <dbl>, age <dbl>,
# interestGroup <chr>, joviality <dbl>, joviality_bins <fct>,
# Education <dbl>, Food <dbl>, Recreation <dbl>, Shelter <dbl>, Wage <dbl>
3.4.1 Entries Check
To ensure data accuracy, the code chunk below checks the completeness of the data. Given that the data has a time period of one year, the code examines if the participants have entries for the entire time period.
Show the code
#check if participants_id have entries for the entire year
participant_counts <- resident_profile %>%
group_by(participantId) %>%
summarise(num_months = n_distinct(year_mth)) %>%
ungroup()
filtered_count <- participant_counts %>%
filter(num_months != 12) %>%
nrow()
filtered_count [1] 131
It has been observed that there are 131 participants who only have 1 entry on Mar 2022. To avoid inaccuracy, these group of participants will be excluded from the analysis.
3.4.2 Missing Values Check
Through the code chunk below, we confirmed that there are no missing values in resident_profile dataset.
#Check for missing values
any(is.na(resident_profile))[1] FALSE
3.4.3 Revised Resident’s Profile Dataset
Given that we have removed duplicates in section 3.3.2, removed entries in section 3.4.1 and observed no missing values in section 3.4.2, the resident’s profile data set have been revised. We will be using the knitr: kable() function to display the final dataset.
Show the code
#create a revised dataframe to exclude id that do not have entries for the time period
resident_profile_rev <- resident_profile %>%
group_by(participantId) %>%
mutate(num_months = n_distinct(year_mth)) %>%
ungroup() %>%
filter(num_months == 12) %>%
select(1:13,15:19)
#output for dataframe using knitr:: kable
kable(head(resident_profile_rev), "simple") | participantId | year_mth | Cashflow | age_group | educationLevel | householdSize | haveKids | Income | Expenses | age | interestGroup | joviality | joviality_bins | Food | Recreation | Shelter | Wage | RentAdjustment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2022-03 | 10722.01 | 30-39 | HighSchoolOrCollege | 3 | TRUE | 11931.95 | 1209.94 | 36 | H | 0 | (0,0.2] | 268.26 | 348.68 | 554.99 | 11931.95 | 0 |
| 0 | 2022-04 | 7558.67 | 30-39 | HighSchoolOrCollege | 3 | TRUE | 8636.88 | 1078.21 | 36 | H | 0 | (0,0.2] | 265.79 | 219.42 | 554.99 | 8636.88 | 0 |
| 0 | 2022-05 | 7807.63 | 30-39 | HighSchoolOrCollege | 3 | TRUE | 9048.16 | 1240.53 | 36 | H | 0 | (0,0.2] | 264.54 | 382.99 | 554.99 | 9048.16 | 0 |
| 0 | 2022-06 | 7732.59 | 30-39 | HighSchoolOrCollege | 3 | TRUE | 9048.16 | 1315.57 | 36 | H | 0 | (0,0.2] | 256.90 | 465.67 | 554.99 | 9048.16 | 0 |
| 0 | 2022-07 | 6704.27 | 30-39 | HighSchoolOrCollege | 3 | TRUE | 8636.88 | 1932.61 | 36 | H | 0 | (0,0.2] | 270.13 | 1069.48 | 554.99 | 8636.88 | 0 |
| 0 | 2022-08 | 8290.55 | 30-39 | HighSchoolOrCollege | 3 | TRUE | 9459.44 | 1168.89 | 36 | H | 0 | (0,0.2] | 261.76 | 314.13 | 554.99 | 9459.44 | 0 |
4. Exploratory Data Visualization
In this section, we will design plots with interactivity for users to study and explore the data. The plots are created with the use of giraph , plotly, and patchwork packages.
4.1 Interactive Dashboard
A dashboard is created to provide an overview of the demographics of residents in City of Engagement across age group. Bar chart is chosen to show segments of information by comparing different categorical variables. A design layout is included in the code to better visualized the output through patchwork. Moreover, tooltip is used to highlight the specific age group at the point of the data.
Show the code
#Create new df by doing a full join with participants_counts
participants_new_rev <- full_join(
participants_new, participant_counts,
by = c("participantId" = "participantId")) %>%
group_by(participantId) %>%
#participants_counts contains the number of months the participants entries are in
ungroup() %>%
#filter away participantsId who do not have entries for the full year
filter(num_months == 12) %>%
select(-num_months)
#create tooltip to display age group
participants_new_rev$tooltip <-c(paste0(
"Age Group:", participants_new_rev$age_group))
#Bar chart for resident's age distribution
p1 <- ggplot(data= participants_new_rev,
aes(x = age_group)) +
geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip,
stackgroups = TRUE,
data_id= age_group)) +
scale_fill_manual(values = wes_palette("Chevalier1")) +
xlab("Age Group") +
ylab("No. of participants") +
theme(axis.text.x=element_text(size=5)) +
theme(axis.title.y=element_text(size=10)) +
ylim(0,250)
#Bar chart for resident's household size distribution
p2 <- ggplot(data= participants_new_rev,
aes(x = householdSize)) +
geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip,
stackgroups = TRUE,
data_id= age_group)) +
scale_fill_manual(values = wes_palette("Chevalier1")) +
xlab("Size of Household") +
ylab("") +
theme(axis.text.x=element_text(size=5)) +
theme(axis.title.y=element_text(size=10)) +
ylim(0,350)
#Bar chart for resident's education level
p3 <- ggplot(data= participants_new_rev,
aes(x = educationLevel)) +
geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip,
stackgroups = TRUE,
data_id= age_group)) +
xlab("Education Level") +
ylab("") +
theme(axis.text.x=element_text(size=5)) +
theme(axis.title.y=element_text(size=10)) +
ylim(0,350)
#Bar chart to visualize if residents have kids
p4 <- ggplot(data= participants_new_rev,
aes(x = haveKids)) +
geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip,
stackgroups = TRUE,
data_id= age_group)) +
ylab("") +
theme(axis.text.x=element_text(size=5)) +
theme(axis.title.y=element_text(size=10)) +
ylim(0,400)
#Bar chart for residents' interest group
p5 <- ggplot(data= participants_new_rev,
aes(x = interestGroup)) +
geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip,
stackgroups = TRUE,
data_id= age_group)) +
xlab("Interest Group") +
ylab("No. of participants") +
theme(axis.text.x=element_text(size=5)) +
theme(axis.title.y=element_text(size=10)) +
ylim(0,120)
#Bar chart for residents' joviality in bins
p6 <- ggplot(data= participants_new_rev,
aes(x = joviality_bins)) +
geom_bar_interactive(aes(tooltip = participants_new_rev$tooltip,
stackgroups = TRUE,
data_id= age_group)) +
xlab("Joviality") +
ylab("No. of participants") +
theme(axis.text.x=element_text(size=5)) +
theme(axis.title.y=element_text(size=10)) +
ylim(0,250)
#design layout for the patchwork figure
design <- "
132
132
554
554
666
666
"
girafe(code = print(p1 + p2 + p3 + p4 + p5 + p6 +
plot_layout(design = design,) +
plot_annotation(title =
"Demographics Insights of residents in City of Engagement",
theme = theme(plot.title = element_text(size = 20, hjust=0.5))
)),
width_svg = 12,
height_svg = 6,
options = list(
opts_hover(css = "fill: #02401B;"),
opts_hover_inv(css = "opacity:0.2;")
)
) Observations:
- The city is facing an aging population as age group is left-skewed.
- Low interest for future education as education level is right-skewed.
- Small family size with higher proportion of participants of not having Kids.
- Uniformly distributed Interest Group. No preference among age group.
- Joviality level seems to be decreasing at a decreasing rate
Click on the graph and hover around each demographics.
The respective age group will be displayed.
4.2 Financial Health of Participants
To know more about the financial health of the participants, interactive geom_point is used to plot against the time period. tooltip is included to create a snapshot of the financial health status of the participants at the time period. In addition, the plot contains hover effect with the use of data_id aesthetic to highlight the trend of the participant’s cash flow.
Show the code
#tooltip output to display ID, Cashflow, Income, and Expenses
resident_profile_rev$tooltip <- paste0(
"Participant's ID = ", resident_profile_rev$participantId,
"\n Cashflow = ", format(resident_profile_rev$Cashflow, big.mark = ","),
"\n Income = ", format(resident_profile_rev$Income, big.mark = ","),
"\n Expenses = ", format(resident_profile_rev$Expenses, big.mark = ",")
)
#tool_tip design
tooltip_css <- "background-color: lightgrey; #<<
font-style:bold; color: #446455;" #<<
ie <- ggplot(data=resident_profile_rev) +
geom_point_interactive(aes(x = year_mth, y = Cashflow,
tooltip = resident_profile_rev$tooltip,
data_id = participantId,
#if Cashflow >0 = Green, else Red
color = ifelse(Cashflow >= 0,
"Above 0", "Below 0")
)) +
scale_color_manual(values = c("Above 0" = "#446455",
"Below 0" = "#C93312")) +
#remove legend title
labs(color = "") +
labs(title="Financial Health of Participants from Mar 2022 to Feb 2023") +
ylab("Cashflow ($)") + xlab("Year-month") +
scale_y_continuous(labels = comma_format()) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
theme_minimal() |>
girafe(
ggobj = ie,
width_svg = 6,
height_svg = 6*0.618,
options = list( #<<
opts_tooltip(css = tooltip_css), #<<
opts_hover_inv(css = "opacity:0.1;") #<<
)
) Observations:
- Majority of participants have positive cash flow (Good Credit Rating)
- Negligible growth on cash flow
- No drastic changes in participant’s cash flow over time
4.3 The Circulation of Money
As illustrated earlier, the cash flow is a holding transaction. It is holded by the individual after repaying the necessary debt/expenses. To build more wealth, it is ideal for Wage to increase or for expenses to decrease.
With that in mind, a line scatter pot is used to identify the trend in various categories over a time period. The participants dataset have been revised to exclude participants who did not fulfill the entries check. It has been reformatted before executing a full_join with participants_count data set. The joined table concatenates the count of the year-mth period (num_months) that is not available in the dataset and filters it away.
The plot_ly() graph below shows the total amount circulated by the participants.
Show the code
#to reformat partcipantId, create year-mth, round off amount to 2.d.p
financial_journal_lessdup <- financial_journal_lessdup %>%
mutate(participantId = as.character(participantId),
year_mth = format(as.Date(financial_journal_lessdup$timestamp), "%Y-%m",
amount = abs(round(amount,2)),
.before = 1))
#Create new df by doing a full join with participants_counts
financial_journal_lessdup_lessentries <- full_join(
financial_journal_lessdup, participant_counts,
by = c("participantId" = "participantId")) %>%
group_by(participantId) %>%
#participants_counts contains the number of months the participants entries are in
mutate(num_months = n_distinct(year_mth)) %>%
ungroup() %>%
#filter away participantsId who do not have entries for the full year
filter(num_months == 12) %>%
select(participantId, year_mth, category,amount)
#create another df to group by year_mth, category
grouped_data_rev <- financial_journal_lessdup_lessentries %>%
#group the columns in the following order
group_by(year_mth, category) %>%
summarize(total_amount = sum(amount))
#creating interactive graph
plot_ly(data = grouped_data_rev,
x = ~year_mth, y = ~total_amount, color = ~category,
type = 'scatter', mode = 'line',
hovertemplate = ~paste("Year-Month:", year_mth,
"<br>Amount:", format(total_amount, big.mark = ","))) |>
#Configure title and axes
layout(title = "The Circulation of Money",
xaxis = list(title = "Time Period"),
yaxis = list(title = "Amount"))Observations:
- Income (~Wage) is fairly constant except for Mar 2022. As Income varies with overtime, it can be inferred that the abnormally could be a form of bonus, grant, incentives given by the service centre
- Expenses distribution seems to be constant with lesser spending on Education expenses . However, there is a slight fluctuation in Recreation expenses.
4.3.1 Spending Patterns of participants
As observed earlier, there are a slight fluctuation in Recreation expenses. Henceforth, Wage will be removed in the graph to better visualize the expenses.
Plot_ly graph is plotted to visualize the spending patterns.
Show the code
#create a new dataset
grouped_data_rev_new <- financial_journal_lessdup_lessentries %>%
mutate(amount = abs(round(amount,2))) %>%
#group the columns in the following order
group_by(category,year_mth) %>%
summarize(total_amount = sum(amount))
# Filter out "Wage" category from the data frame
grouped_data_rev_newest <- grouped_data_rev_new %>%
filter(category != "Wage")
#creating interactive graph
plot_ly(data = grouped_data_rev_newest,
x = ~year_mth, y = ~total_amount, color = ~category,
type = 'scatter', mode = 'line',
hovertemplate = ~paste("Year-Month:", year_mth,
"<br>Amount:", format(total_amount, big.mark = ","))) |>
#Configure title and axes
layout(title = "Total Expenses incurred by partcipants \nin City of Engagement from Mar 2022 - Feb 2023",
xaxis = list(title = "Time Period"),
yaxis = list(title = "Total Expenses"))Observations:
- Education remain constant throughout while Shelter decrease in Mar 2022 and remain constant. As identified, both expenses are a fixed expense.
- Recreation expenses fluctuates more in comparison to Food expenses.
5. Confirmatory Data Analysis Visualization
In this section, we will focus more on the statistical testing that are used in Confirmatory Data analysis. The plots are created with the use of ggbarstats , ggbetweenstats, and gscatterstats packages.
5.1 Association Test between Age group and Education Level
As observed in Section 4.1 - Interactive Dashboard, we noticed that the age-group is left-skewed while the education level is right-skewed. Therefore, we would like to test if there is any association between the two variables. Notably, the association test is non-parametric and thus, does not have to conform to the normality assumption.
At 95% confidence level,
Ho : No association exists between the age group and education level
H1: Association exists between the age group and education level
Show the code
ggbarstats(data = resident_profile_rev,
x = educationLevel, y = age_group,
xlab= "Age Group", ylab = "Education Level",
title = "Comparison of Education level across Age Group",
type = "nonparametric", conf.level = 0.95,
package = "wesanderson", palette = "Chevalier1"
)
From the graph, we observed that majority of the age group have a HighSchoolOrCollege education level whereas the interest for further study decreases as the age increase. Interestingly, young adults aged 20-29 have attained a higher education level than other age group.
From the test result above (p<0.05) , we conclude that there is an association between the age group and education level as we reject the null hypothesis.
5.2 Differences in Joviality based on Education Level
As defined earlier, Joviality indicates the participant’s overall happiness at the start of the study. We will like to found out if there is a difference in Joviality based on Education Level. Before testing our hypothesis, we will perform a normality assumption test at 95% confidence level.
4.4.1 Normality Assumption Test
Normal Quantile Plot (QQ Plot). For a normally distributed data values, the dot points should be with the 95% envelope or scatter very closely to the slope line in between the two envelopes lines. From the data visualization above, we can visually confirmed that the observed values failed to conform to the normality assumption.
At 95% confidence level:
Ho: the observed distribution resembles normal distribution
H1: the observed distribution failed to resemble normal distribution
We failed to reject the Ho if p-value is above the alpha value of 0.05.
We reject Ho if p-value is <0.05.
Given that normality is not assumed, we will use the non-parametric test such as Wilcoxon test statistics.
Show the code
qq_plot <- ggplot(data=resident_profile_rev,
aes(sample = joviality)) +
stat_qq() +
stat_qq_line()ggplot(data=resident_profile_rev,
aes(sample = joviality)) +
stat_qq() +
stat_qq_line()
# ggplot(data = resident_profile_rev,
# aes(x= joviality)) +
# geom_histogram(bins=10) +
# xlab("Joviality") +
# ylab("No.of Residents") +
# geom_vline(aes(xintercept = average_joviality), col ="red", linewidth=1 ) +
# annotate("text", x=0.37, y= 1400, label="Average Joviality:", size=4, color = "red") + annotate("text", x=0.37, y= 1300, label=format(average_joviality, big.mark=","), size=4, color = "red")
# geom_vline(aes(xintercept = median_joviality), col ="green", linewidth=1 ) Based on the result above, we concluded that there is enough statistical evidence to reject the null hypothesis. Since the p-values fall below (p < 0.05), normality is not assumed. Henceforth, we will use the Kruskal-Wallis test.
4.4.2 Kruskal-Wallis Test for Joviality across Education Level
We will test the following hypothesis at 95% Confidence Level:
Ho : the median Joviality across different education level is the same
H1: the median Joviality across different education level is not the same
Show the code
ggbetweenstats(data = resident_profile_rev,
x= educationLevel, y= joviality, type ="np",
xlab= "Education Level", ylab = "Joviality",
title = "Comparison of Joviality across Education Level",
pairwise.comparisons = TRUE, pairwise.display ="ns", conf.level = 0.95,
package = "wesanderson", palette = "Chevalier1"
)
As seen above, the P-value is lower than the 0.05. As such, there is enough statistical evidence to reject the null hypothesis that the median joviality across education level is the same.
Additionally, we want to find out if there any distinct similarities between the district. Through the graph above, we discovered that not all pair comparison are statistically significant. The pair (Low and HighSchoolOrCollege) is not statistically significant with a P-value of 0.14, which is greater than 0.05. Thus, we cannot reject the null hypothesis that there is not differences between the joviality level between the pair.
4.5 Association between Age group and Joviality
We discovered that there is a statistical difference in joviality across Education Level, we would like to examine if the same applies and if there is an association between the age group. As Joviality is a continuous variable, we used mutate from dplyr to split it into 5-class variables, with each range at 20%.
At 95% confidence level,
Ho : No association exists between the age group and joviality level
H1: Association exists between the age group and joviality level
ggbarstats(data = resident_profile_rev,
x = joviality_bins, y = age_group,
title = "Comparison of Joviality across age-group",
pairwise.comparisons = TRUE, pairwise.display ="s", conf.level = 0.95,
package = "wesanderson", palette = "Chevalier1"
)
5.2 Differences in joviality across income
We want to find out if there are any differences in joviality across income.
The following considerations have been made:
assume income increases over the years
opacity included to highlight the contrast
Tooltip to indicate variables
participantID
Income
Cashflow
We will be using plot_ly for the interactive graph.
plot_ly(data = resident_profile_rev,
x = ~joviality, y = ~Income,
hovertemplate = ~paste("<br>Participant's ID:",participantId,
"<br>Cashflow:", Cashflow,
"<br>Income:", Income,
"<br>Expenses:", Expenses),
type = "scatter",
mode = "markers",
marker = list(opacity = 0.7,sizemode = "diameter",
line = list(width =0.1, color = "white"))) |>
#add title and labels to axis
layout(title = "Interactive scatterplot of Income vs Joviality" ,
xaxis = list(title = "Joviality level") ,
yaxis = list(title = "Income"))Observations:
When income exceeds $15k, joviality level decreases
it is similar to income range of $10-$15k where majority have low joviality
Joviality level are more spread out when income range below $5k
Happiness level of joviality ranges from 0 to 1. A score of 0.4 indicates that participant is dull while a score of 0.8 indicates joy.
DT::datatable(resident_profile_rev, class= "compact")d <- highlight_key(resident_profile_rev)
p <- ggplot(d,
aes(Expenses,
Income)) +
geom_point(size =1)
gg <- highlight(ggplotly(p),
"plotly_selected")
dt <- DT::datatable(d,
selection =list(mode="single",
target="column"))
crosstalk::bscols(gg,
dt,
widths = 6) fj_cat <- financial_journal_lessdup %>%
#recode ID from dbl to chr, year_mth
mutate(participantId = as.character(participantId),
year = format(as.Date(financial_journal_lessdup$timestamp), "%Y"),
mth = format(as.Date(financial_journal_lessdup$timestamp), "%m"),
amount = abs(round(amount,2)),
.before = 3) %>%
#group the columns in the following order
group_by(year, mth, category) %>%
summarize(total_amount = sum(amount))
fj_cat# A tibble: 62 × 4
# Groups: year, mth [12]
year mth category total_amount
<chr> <chr> <chr> <dbl>
1 2022 03 Education 14354.
2 2022 03 Food 327829.
3 2022 03 Recreation 649580.
4 2022 03 RentAdjustment 53504.
5 2022 03 Shelter 631623.
6 2022 03 Wage 6402720.
7 2022 04 Education 11424.
8 2022 04 Food 304282.
9 2022 04 Recreation 389688.
10 2022 04 RentAdjustment 1429.
# ℹ 52 more rows
ggplot(data = fj_cat,
aes(x= category, y=total_amount, size = total_amount)) +
geom_point(alpha=0.8, show.legend= FALSE) +
scale_size(range =c(2,12)) +
labs(titles = "Year{as.integer(frame_time)}",
x = "Category",
y = "Total_amount") +
transition_time(as.integer(year)) +
ease_aes("linear") +
scale_y_continuous(labels = comma_format()) 
5.3 Average Cashflow available for Residents across Age Group
Show the code
tooltip_css <- "background-color:#C7B19C; #<<
font-style:bold; color:#446455;" #<<
tooltip <- function(y, ymax, accuracy = 1) { #<<
mean <- scales::number(y, accuracy = accuracy) #<<
sem <- scales::number(ymax - y, accuracy = accuracy) #<<
paste("Average Cashflow:", mean, "+/-", sem) #<<
} #<<
pp <- ggplot(data=resident_profile_rev,
aes(x = age_group),
) +
stat_summary(aes(y = Cashflow,
tooltip = after_stat( #<<
tooltip(y, ymax))), #<<
fun.data = "mean_se",
geom = GeomInteractiveCol, #<<
fill = "#D3DDDC"
) +
stat_summary(aes(y = Cashflow),
fun.data = mean_se,
geom = "errorbar", width = 0.2, linewidth = 0.2
) +
labs(title="Average Cashflow available to Residents by age_group") +
ylab("Total Cashflow") +
xlab("Age Group")
girafe(ggobj = pp,
width_svg = 8,
height_svg = 8*0.618,
options = list( #<<
opts_tooltip( #<<
css = tooltip_css)) #<<
)ggplot(data =resident_profile_rev,
aes(x = Food,
y = age_group)) +
geom_density_ridges(
scale = 3,
rel_min_height = 0.01,
bandwidth = 3.4,
fill = "#D3DDDC",
color = "#446455"
) +
scale_x_continuous(
name = "English grades",
expand = c(0, 0)
) +
scale_y_discrete(name = NULL, expand = expansion(add = c(0.2, 2.6))) +
theme_ridges()
ggscatterstats(
data = resident_profile_rev,
x = Expenses,
y = Shelter,
type = "nonparametric",
marginal = TRUE,
title = "Significant Test of Correlation between Expenses and Shelter",
xlab = "Expenses",
ylab = "Shelter"
)
#plotting correlation between age and income
ggscatterstats(data = resident_profile_rev,
x = age, y = Income,
type = "nonparametric",
marginal = TRUE,
title = "Significant Test of Correlation between Age and Income",
xlab = "Age",
ylab = "Income"
)
Cashflow / Income
Show the code
#plotting correlation between age and income across education level
edu_low <- ggscatterstats(data = resident_profile_rev |>
filter(educationLevel == "Low"),
x = Income, y = Cashflow,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "Low",
x = "Income", y = "Cashflow") +
scale_y_continuous(labels = comma_format())
edu_hc <- ggscatterstats(data = resident_profile_rev |>
filter(educationLevel == "HighSchoolOrCollege"),
x = Income, y = Cashflow,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "High School",
x = "Income", y = "Cashflow") +
scale_y_continuous(labels = comma_format())
edu_bach <- ggscatterstats(data = resident_profile_rev |>
filter(educationLevel == "Bachelors"),
x = Income, y = Cashflow,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "Bachelors",
x = "Income", y = "Cashflow") +
scale_y_continuous(labels = comma_format())
edu_grad <- ggscatterstats(data = resident_profile_rev |>
filter(educationLevel == "Graduate"),
x = Income, y = Cashflow,
type = "nonparametric") +
theme(axis.text.x = element_text(angle = 45, vjust = 1, hjust=1)) +
labs(title = "Graduate",
x = "Income", y = "Cashflow") +
scale_y_continuous(labels = comma_format())
#combined plot and ensure layout is in order
corr_edu <- ((edu_low + edu_hc) / (edu_bach + edu_grad) + plot_spacer())
#add labels
corr_edu + plot_annotation(title = "Correlation between Cashflow and Income",
subtitle = "High correlation between Cashflow and Income at all education level",
theme = theme(
plot.title = element_text(size = 18),
plot.subtitle = element_text(size = 12)))
Observations:
Almost perfect positive linear relationship between Cashflow and Income (>0.9)
As education level increases, the Sprearman correlation coefficient increases